Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add and update Javadoc on realtime classes #5585

Merged
merged 27 commits into from
May 23, 2024
Merged

Conversation

abyrd
Copy link
Member

@abyrd abyrd commented Dec 19, 2023

This PR consists almost entirely of changes to Markdown documentation and Javadoc on classes associated with realtime updates. Many of the new comments include questions to be resolved. The overall intent is to make the code more approachable for people first encountering it or returning to it. This is strongly related to #5365.

As I was getting into working on #4002 and #4718 (refactoring transit data models and realtime), I was finding it difficult to understand what many of the classes represented or did, how they fit together into separate models (e.g. general internal transit representation vs. raptor-specific), what the relationships of those models were to one another, how and when conversions happen, how and why the GTFS and SIRI versions differed from one another etc. Several aspects of the transit data and realtime handling have been heavily modified since I last worked on them, and some of these changes don't seem to be well documented. My original intent for how certain classes (the updaters etc.) are used was also not clearly expressed in the Javadoc or Markdown documentation. So I am progressively adding and updating the documentation as I investigate and see things that would benefit from more clarity or changes.

There are a couple of small non-documentation changes included here, but they're pure refactors and included in commits separate from the documentation commits.

@abyrd abyrd force-pushed the abyrd/realtime-javadoc branch from b2b45a2 to 710b678 Compare January 24, 2024 12:13
Copy link

codecov bot commented Jan 24, 2024

Codecov Report

Attention: Patch coverage is 40.54054% with 22 lines in your changes are missing coverage. Please review.

Project coverage is 67.91%. Comparing base (c3147f5) to head (df3361f).
Report is 34 commits behind head on dev-2.x.

Files Patch % Lines
...opentripplanner/ext/siri/SiriTripPatternCache.java 0.00% 11 Missing ⚠️
...pentripplanner/ext/siri/updater/SiriSXUpdater.java 0.00% 2 Missing ⚠️
...er/ext/siri/updater/SiriETGooglePubsubUpdater.java 0.00% 1 Missing ⚠️
...pentripplanner/ext/siri/updater/SiriETUpdater.java 0.00% 1 Missing ⚠️
...t/siri/updater/azure/AbstractAzureSiriUpdater.java 0.00% 1 Missing ⚠️
...outing/impl/DelegatingTransitAlertServiceImpl.java 0.00% 1 Missing ⚠️
...tripplanner/transit/model/network/TripPattern.java 50.00% 1 Missing ⚠️
...anner/updater/alert/GtfsRealtimeAlertsUpdater.java 0.00% 1 Missing ⚠️
...pplanner/updater/trip/MqttGtfsRealtimeUpdater.java 0.00% 1 Missing ⚠️
...entripplanner/updater/trip/PollingTripUpdater.java 0.00% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@              Coverage Diff              @@
##             dev-2.x    #5585      +/-   ##
=============================================
+ Coverage      67.89%   67.91%   +0.02%     
- Complexity     16560    16589      +29     
=============================================
  Files           1911     1913       +2     
  Lines          72465    72629     +164     
  Branches        7446     7451       +5     
=============================================
+ Hits           49201    49327     +126     
- Misses         20744    20773      +29     
- Partials        2520     2529       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@t2gran t2gran added this to the 2.6 (next release) milestone Mar 12, 2024
Uniformly add RT_AB to all TODO and FIXME comments
@abyrd
Copy link
Member Author

abyrd commented Mar 12, 2024

I have re-read, cleaned up and finalized this whole batch of comments and clarifications on RT-related classes. These comments imply or suggest a lot of other work, but that can be decided on and carried out in later PRs.

There is now a detailed package.md file in src/main/java/org/opentripplanner/updater/package.md with an SVG illustration. This is rather long and will not show up in the PR diff by default so I'd suggest browsing to it directly to review it:
https://github.com/opentripplanner/OpenTripPlanner/blob/abyrd/realtime-javadoc/src/main/java/org/opentripplanner/updater/package.md

The only non-comment code changes should be in TripPatternCache and SiriTripPatternCache. I combined a few trailing builder method calls into one long chain for readability. I also renamed the logger instance with an uppercase identifier because it's static and final.

All TODO and FIXME items introduced in this PR should now include RT_AB to make them identifiable using the IntelliJ TODO filters (\btodo rt_ab\b.*).

@abyrd abyrd marked this pull request as ready for review March 12, 2024 14:29
@abyrd abyrd requested a review from a team as a code owner March 12, 2024 14:29
@abyrd abyrd requested review from t2gran and vpaturet March 12, 2024 14:29
@abyrd
Copy link
Member Author

abyrd commented Mar 14, 2024

I removed a bit of unused draft code, the enum TimetableSnapshot.TimetableSnapshotState to capture the sequence of four states a snapshot can pass through that are currently encoded with two separate boolean fields readOnly and dirty. This enum was not used but served as documentation of how this class is intended to be used, and for an intended refactor. I moved it to a new branch abyrd/realtime-rework.

private final SiriTripPatternIdGenerator tripPatternIdGenerator;

// TODO RT_AB: clarify name and add documentation to this field, and why it's constructor injected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that it's injected so that it's easier to test, because you can write a very simple mock function rather than a full blown implementation of TransitService.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constructor injection is the recommended way to inject services - no need to comment, if not it should be commented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it's only used in one place, and I didn't spot any uses in testing. But maybe I missed something, or it's done this way to facilitate potential future testing. There may also have been a judgement that TransitService was just too broad as the SiriTripPatternCache only needed access to this one function. In which case this is a sort of anonymous sub-interface to better encapsulate the workings of TransitService.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is about decoupling, a low level thing should not point to a high level thing because that would create cyclic dependencies. It make it difficult test, but also to reason about. In general I am fan of creating a Functonal interface in these situation witch can be used to document the feature. There is no need for the TransitService to implement this interface, the injector can do the mapping.

@@ -32,15 +32,15 @@
import org.slf4j.LoggerFactory;

/**
* Maps the TransitLayer object from the OTP Graph object. The ServiceDay hierarchy is reversed,
* Maps the TransitLayer object from the TransitModel object. The ServiceDay hierarchy is reversed,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Maps the TransitLayer object from the TransitModel object. The ServiceDay hierarchy is reversed,
* Maps the {@link TransitLayer} object from the {@link TransitModel} object. The ServiceDay hierarchy is reversed,

I prefer the linked version as this changes the docs here, when the class is renamed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntelliJ has an option to extend rename operations to strings, comments etc. Where class names are clearly distinct from any plain text (via use of CamelCase) it may be sufficient to use the names.

Screenshot 2024-03-31 at 14 29 15

This may be a question of style and uniformity more than functionality, since an IDE can readily identify all instances of the class name. There are about 19 references in comments without the link notation. If we change some of them we should probably change all of them. Extending that to every class name in every comment might arguably lead to a decrease in readability or maintainability.

Screenshot 2024-03-31 at 14 30 28

Of course if it is already a policy to progressively introduce link in any new comments then we should update these.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is discussed in the dev-meeting. There are pros/cons for both approches. Using the linking is good because it provide hover JavaDoc, but doing it the first time a class is mentioned is enough. Also, in this case the coupling is so strong that it it probably just noice. When refactoring we should always check stings as well. Note! Avoid using link if it introduce an illegal dependency (e.g. linking to the OTP internal transit model from Raptor).

Tip! I use LiveTemplates to speed up typing: co-> {@code } and li->{@link }. (Example template: {@link $SELECTION$$END$})

Copy link
Member

@leonardehrenfried leonardehrenfried left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this important work. I have left a few comments that I think you could think about.

@abyrd
Copy link
Member Author

abyrd commented Mar 31, 2024

Updates have been made in response to all comments / change requests, except for #5585 (comment) awaiting further input.

@leonardehrenfried leonardehrenfried self-requested a review April 2, 2024 08:42
Copy link
Member

@t2gran t2gran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good @abyrd ! Some of my comments may be a bit hash, but I did not have the time to find a better way of expressing my self - apologise for that.

One thing sticks out - there are a lot of JavaDoc on private fields - it is ok with short implementation notes on these, but in general it should be avoided. Instead comments should be on the public accessors(get/or set if get does not exist). Also, if the comment is about behaviour - it belongs on a method inside the same class or on the type of the field. You don't need to fix this, but keep it in mind for the future - we will probably itterate over this several times and in the end, hopefully have the doc in the right place. This is also about being DRY - witch apply to doc as well as code.

*/
public class SiriAlertsUpdateHandler {

private static final Logger LOG = LoggerFactory.getLogger(SiriAlertsUpdateHandler.class);
private final String feedId;
private final Set<TransitAlert> alerts = new HashSet<>();
private final TransitAlertService transitAlertService;
/** How long before the posted start of an event it should be displayed to users */

/** The alert should be displayed to users this long before the activePeriod begins. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java doc should not be added to private fields, but the get (or set) accessor.

Copy link
Member Author

@abyrd abyrd May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the private field can only be set via the constructor. The comment can be moved to the constructor but then for a maintainer working on the internals of this class to understand what the earlyStart field does, they'd need to find that it's set in the constructor, read the constructor javadoc, and confirm that the documented constructor parameter sets this field.

I get the idea of wanting the documentation concentrated on public methods, but I think it depends whether the documentation is intended for developers working on this code, or for developers making use of the code. In a case like this I'd be inclined to duplicate the documentation - maintenance can cross a line to being very tedious when the reader has to go searching for the meaning of every field.

Some DRY could be achieved by linking the private field Javadoc to the only place it's set, but this requires a verbose approach that is just as long as the comment explaining the purpose of the field, and much more fragile/full of irrelevant detail:
/** @see #SiriAlertsUpdateHandler(String, TransitModel, TransitAlertService, SiriFuzzyTripMatcher, Duration) */

I also find there's something non-generalizable about a convention of documenting on an accessor method because the relationship from fields to methods that mutate them is one-to-many. There's not one unambiguous place to put the documentation (on a setter, a factory method, one of several constructors?) but there is a single field they're all referencing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to an @see annotation on the constructor javadoc.

* Creates alert from PtSituation with all textual content
* Creates a builder for an internal model TransitAlert, including all textual content from the
* supplied SIRI PtSituation. The feed scoped ID of this TransitAlert will be the single feed ID
* associated with this update handler, plus the situation number provided in the SIRI feed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feed scoped ID of this TransitAlert will be the single feed ID
associated with this update handler, plus the situation number provided in the SIRI feed.

The strategy for handling IDs should be documented in a more visible place than on a private method. I also think we probably need to allow more than one strategy - so it should be extracted so it is pluggable. This is probably one of the main things we need to fix to more this from the Sandbox to the main code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a maintenance perspective, this may be a good place to document the structure of the IDs because they are created during this mapping operation. Although I value DRY, the same concern exists here as other places above: increasing visibility of docs for people reusing/configuring the encapsulated code might make it hard for the maintainer to find those descriptions and use them to quickly understand the actions of this method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also put the information in the class-level Javadoc, and rephrased the information for the two contexts. Also included a note that FeedScopedId creation strategy should be pluggable or configurable as a condition for moving out of sandbox.


An OTP instance can have multiple sources of realtime data at once. In some cases the transit data includes several feeds of scheduled data from different providers, with one or more types of realtime updates for those different feeds.

In a large production system, ideally all the scheduled data would be integrated into a single data source with a unified ID namespace, and all the realtime data would also be integrated into a single data source with an exactly matching namespace. This would be the responsibility of a separate piece of software outside (upstream of) OTP. In practice, such an upstream data integration system does not exist in many deployments and OTP must manage several static and realtime data sources at once. Even when data feeds are well-integrated, the different kinds of realtime (arrival time updates, vehicle positions, or text alerts) may be split across multiple feeds as described in the GTFS-RT spec, which implies polling three different files.
Copy link
Member

@t2gran t2gran Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a large production system, ideally all the scheduled data would be integrated into a single data source with a unified ID namespace, and all the realtime data would also be integrated into a single data source with an exactly matching namespace.

I disagree, it does not scale - e.g. putting all transit data for the world into the same source is not going to work. We have done it in Norway, and there are many advatiges - but there are also many problems(scaling, performance, resource allocation, cost). But I do agree that IDs should be unique - this requirment can be met by convention. Currently OTP add a feed prefix to ids to make them unique inside OTP, but this is not optimal because IDs should be unique and the same across whole system and in TIME. OTP only have limited support for entity versioning of NeTEx quays and StopPlaces.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't considering a whole-world case, where many other things about OTP also wouldn't scale. I was considering current typical OTP instances, where the largest ones are usually for single countries like Norway, Germany, or France and many deployments are much smaller.

Here I was trying to express something that I heard in past discussion: Certain responsibilities are outside the scope of OTP, and I was under the impression most of us considered a data integration pipeline to be an external service upstream of OTP. Similar to how realtime prediction is also seen as ideally outside and upstream of OTP.

If this is not the case, and integration of multiple data sources within OTP itself is considered advantageous (rather than just a convenience feature) I would like to document why and how. Is the key fact that some data sources update more frequently than others, and they update at different times? So rather than producing and ingesting the whole integrated data feed each time one of the constituent feeds changed, are you thinking of live OTP instances live-patching large chunks of their internal data? Basically, entire feeds would be replaced in the same way as realtime messages are currently applied? If this is the case we should begin documenting these ideas in the realtime working group doc.

On ID namespaces: If all the IDs across many feeds are guaranteed to be unique, in some sense they are already using a unified namespace. The "upstream integration system" I mentioned doesn't have to be merging the feeds - it could also just be data consistency and referential integrity checks for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we disagree here, I am not talking about "live-patching large chunks of their internal data". If you do a single netex data source or multiple files it both comes with a set of challenges you need to deal with - one way is not better than the other - in my eyes. I also think that adding a feedScope to the ID should be an opt in feature - the id is changed. It divide the ecosystem in "downstream" and "upstream" of OTP.

@abyrd
Copy link
Member Author

abyrd commented May 3, 2024

@t2gran all your PR comments have been taken into account, except package.md. I should have that one finished in the next couple of days, and this should be ready for merge soon.

@abyrd
Copy link
Member Author

abyrd commented May 4, 2024

@t2gran I have made updates to the PR in response to most of your comments. The last few comments on package.md focusing on implementation strategy will be copied to the realtime working group document for use in planning the refactor. I consider this PR ready for final sign-off and merge.

Edit: the implementation ideas have now been copied over to the working group doc.

@abyrd abyrd requested a review from t2gran May 4, 2024 16:29
private final SiriTripPatternIdGenerator tripPatternIdGenerator;

// TODO RT_AB: clarify name and add documentation to this field, and why it's constructor injected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is about decoupling, a low level thing should not point to a high level thing because that would create cyclic dependencies. It make it difficult test, but also to reason about. In general I am fan of creating a Functonal interface in these situation witch can be used to document the feature. There is no need for the TransitService to implement this interface, the injector can do the mapping.

@@ -94,6 +94,7 @@ StopPatternBuilder mutate() {
return new StopPatternBuilder(this, null);
}

// TODO RT_AB: documentation or naming - this does not mutate the object in place, it makes a copy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* two layers (one for the streets and one for the public transit data). Here the situation is
* different: this seems to be an indexed and rearranged copy of the main transit data (a
* TransitModel instance).
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good to know where the "Layer" comes from, after working with OTP for many years I did not know. I think we should stick to industry/de facto standards and not invent over own when it comes to architecture naming(not transit domain). I can recommend Domain Driven Design (E.Evens) and Applying UML and Patterns: An Introduction to Object-Oriented Analysis and Design and the Unified Process (C.Larman). These books are a bit old, so not sure what they say on n-tier architecture (layered architecture) - In DDD inversion of control i used to avoid the classic n-tier - to protect the domain logic/main asset. This is useful(or necessary) in complex components like OTP.

* two layers (one for the streets and one for the public transit data). Here the situation is
* different: this seems to be an indexed and rearranged copy of the main transit data (a
* TransitModel instance).
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model responsibilites:

  • manage and encapsulate aggreagates, entities and domain data (p.s. data not logic)
  • Models store data somewhere (usually via a persistance-layer(DB), but in OTP in we store it in memory andby serializing it to disk).
  • In OTP the model serve to groups of clients: Those who read data (API and route engine) and those who read/write data(graphBuiler/updaters). These two type og clients should get access to ONLY the features they need - without access to e.g. writing. Since OTP is all about routing, the model should be tailored for that purpose. If something is getting in the way, it should be extracted and stored somewhere else - maybe in its own model.
  • data integrity - enforcing business rules related to data.
  • support for use-case specific functionality, but not responsible for it.
  • support for none-functional requirements.

The model is not responsible for:

  • Use-case specific rules and data structures. If use-case specific features become "noice" or introduce complexity, then they are best kept outside the model.


An OTP instance can have multiple sources of realtime data at once. In some cases the transit data includes several feeds of scheduled data from different providers, with one or more types of realtime updates for those different feeds.

In a large production system, ideally all the scheduled data would be integrated into a single data source with a unified ID namespace, and all the realtime data would also be integrated into a single data source with an exactly matching namespace. This would be the responsibility of a separate piece of software outside (upstream of) OTP. In practice, such an upstream data integration system does not exist in many deployments and OTP must manage several static and realtime data sources at once. Even when data feeds are well-integrated, the different kinds of realtime (arrival time updates, vehicle positions, or text alerts) may be split across multiple feeds as described in the GTFS-RT spec, which implies polling three different files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we disagree here, I am not talking about "live-patching large chunks of their internal data". If you do a single netex data source or multiple files it both comes with a set of challenges you need to deal with - one way is not better than the other - in my eyes. I also think that adding a feedScope to the ID should be an opt in feature - the id is changed. It divide the ecosystem in "downstream" and "upstream" of OTP.

@abyrd
Copy link
Member Author

abyrd commented May 17, 2024

I updated several comments in response to clarifications from @t2gran, and applied his suggested commits. Ready for re-review.

@abyrd
Copy link
Member Author

abyrd commented May 17, 2024

@t2gran responding to #5585 (comment) here (to avoid it getting lost in the chain): Thanks for the references and details. These points fit with my understanding of what the model is. The remaining question for me is about there being one specific concrete class called Model, which is itself part of the model. Reading up on DDD, it looks like the relevant DDD concept is an "aggregate root". However, I still find it disorienting in this one specific case where the noun is "Model" (as opposed to "Network" or any other noun) because it's implying a set is a member of itself. But this super general question is outside the scope of this PR and we can talk about it some other time.

@abyrd abyrd requested review from t2gran and leonardehrenfried May 17, 2024 10:32
@abyrd abyrd merged commit 1a3ed74 into dev-2.x May 23, 2024
6 checks passed
@abyrd abyrd deleted the abyrd/realtime-javadoc branch May 23, 2024 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants